Results presentation

"Alcohol Effects On Study" dataset

As before, I work with the "Alcohol Effects On Study" dataset. In my first work, I investigated the dataset and looked for the best regressors (html version, ipynb version). In the next work, I analyzed explanations of a boosting model with the SHAP method (html version, ipynb version).

This time, I use the same data preprocessing as previously, leaving only nine explanatory variables. Also, I work only with the "maths" dataset.

Explanations with LIME

In this work, I use two models: a booster model (XGBRegressor) and a Support Vector Machine Regressor (SVR). The XGBRegressor parameters are the same as in the previous work. For a good comparison, I shuffled the dataset with the same random seed and I picked the same examples as in the previous work.

For Pupil 0 the XGBRegressor model predicts grade 12.74. I decomposed the prediction using LIME method implemented in dalex library with random seed = 0:

pupil0_0.png

Using lime library, it's also possible to get a differently formatted plot with these results.

pupil0_0_lime.png

The LIME decompositions are fairly stable. I used a few different random seeds and got similar plots. For example, for random seed = 1 we can observe that the most important variables are the same and have a very similar impact. There is just one difference in the order of the variables. For random seed = 1 the positive contribution of low alcohol consumption on week days outweighs the negative contribution of the family support, and for random seed = 0 it was the other way around.

pupil0_1.png

For Pupil 314 the model predicts grade 13.23. In this case, the LIME decompositions are less stable but still they don't differ too much for various random seeds. Here is the plot for random seed = 1:

pupil314_1.png

For random seed = 2 the five most important variables have similar contribution for the model's predictions. However, this time the negative impact of a high alcohol consumption on week days is estimated as higher, and the negative impact of not getting paid maths classes is estimated as lower. An interesting thing is that this time the pupil's weekend alcohol consumption (which is a bit above the mean) is interpreted as having a negative contribution, whereas previously the contribution was estimated as positive.

pupil314_2.png

Comparison with SHAP

As we can see, LIME gives a bit more detailed explanations than SHAP. Not only it says what was the importance of the variables but also it gives an information what is the interval the model considers for every variable when deciding how the variable contributes to the prediction.

It's also interesting to compare LIME contributions with Shapley values for the surprising example of Pupil 1 which I discussed in the previous work. Similarily to the SHAP values, the contribution of no failures and no absences is considered very big. However, this time we don't see the schocking big positive impact of a high alcohol consumption on weekends which we saw for the SHAP method.

pupil1_0.png

Boosting vs. Support Vector Machine

I trained a support vector regressor and compared explanations for its predictions with the explanations for the boosting model. Below, I include plots for Pupils 0, 1 and 314.

pupil314_svm.pngpupil1_svm.pngpupil0_svm.png

It can be seen that this model puts much more importance to the number of past failures. Moreover, we can see that a high alcohol consumption on weekends of Pupil 0 and Pupil 1 have a significant negative impact on the model's prediction. One could say that such a support vector machine model gives results which could be more interesting for the creators of the dataset who wanted to find the alcohol effects on final grades of pupils.

Appendix

In [18]:
!pip install dalex
!pip install lime
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: dalex in /usr/local/lib/python3.7/dist-packages (1.5.0)
Requirement already satisfied: plotly>=5.1.0 in /usr/local/lib/python3.7/dist-packages (from dalex) (5.5.0)
Requirement already satisfied: scipy>=1.6.3 in /usr/local/lib/python3.7/dist-packages (from dalex) (1.7.3)
Requirement already satisfied: numpy>=1.20.3 in /usr/local/lib/python3.7/dist-packages (from dalex) (1.21.6)
Requirement already satisfied: tqdm>=4.61.2 in /usr/local/lib/python3.7/dist-packages (from dalex) (4.64.1)
Requirement already satisfied: pandas>=1.2.5 in /usr/local/lib/python3.7/dist-packages (from dalex) (1.3.5)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from dalex) (57.4.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=1.2.5->dalex) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=1.2.5->dalex) (2022.5)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from plotly>=5.1.0->dalex) (1.15.0)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.7/dist-packages (from plotly>=5.1.0->dalex) (8.1.0)
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: lime in /usr/local/lib/python3.7/dist-packages (0.2.0.1)
Requirement already satisfied: scikit-image>=0.12 in /usr/local/lib/python3.7/dist-packages (from lime) (0.18.3)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from lime) (1.21.6)
Requirement already satisfied: scikit-learn>=0.18 in /usr/local/lib/python3.7/dist-packages (from lime) (1.0.2)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from lime) (1.7.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from lime) (4.64.1)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (from lime) (3.2.2)
Requirement already satisfied: networkx>=2.0 in /usr/local/lib/python3.7/dist-packages (from scikit-image>=0.12->lime) (2.6.3)
Requirement already satisfied: tifffile>=2019.7.26 in /usr/local/lib/python3.7/dist-packages (from scikit-image>=0.12->lime) (2021.11.2)
Requirement already satisfied: imageio>=2.3.0 in /usr/local/lib/python3.7/dist-packages (from scikit-image>=0.12->lime) (2.9.0)
Requirement already satisfied: pillow!=7.1.0,!=7.1.1,>=4.3.0 in /usr/local/lib/python3.7/dist-packages (from scikit-image>=0.12->lime) (7.1.2)
Requirement already satisfied: PyWavelets>=1.1.1 in /usr/local/lib/python3.7/dist-packages (from scikit-image>=0.12->lime) (1.3.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->lime) (2.8.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib->lime) (0.11.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->lime) (3.0.9)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib->lime) (1.4.4)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from kiwisolver>=1.0.1->matplotlib->lime) (4.1.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.1->matplotlib->lime) (1.15.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.18->lime) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn>=0.18->lime) (3.1.0)
In [106]:
import dalex as dx
import xgboost
import lime

import sklearn
from sklearn.svm import SVR

import pandas as pd
import numpy as np

import random
import matplotlib.pyplot as plt
In [20]:
maths_dataset = pd.read_csv('Maths.csv')
portuguese_dataset = pd.read_csv('Portuguese.csv')
In [21]:
categorical_variables = ['schoolsup', 'famsup', 'paid', 'higher']
numerical_variables = ['studytime', 'failures', 'Dalc', 'Walc', 'absences']
def preprocess_dataset(df):
  new_df = df[['studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'higher', 'Dalc', 'Walc', 'absences', 'G3']]\
           .sample(frac=1, random_state=0).reset_index(drop=True)
  new_df.loc[:, df.dtypes == 'object'] = new_df.select_dtypes(['object'])\
                                         .apply(lambda x: x.astype('category'))
  X, y = new_df.drop(columns='G3'), new_df.G3
  X = pd.get_dummies(X, columns=categorical_variables, drop_first=True)
  return X, y
In [22]:
def fit_boost_model(X, y):
  model = xgboost.XGBRegressor(
      n_estimators=500,
      max_depth=3,
      max_leaves=64,
      use_label_encoder=False
  )
  model.fit(X, y)
  return model
In [92]:
def explain_with_dalex(model, X, y, observations=(0, 1, 314)):
  explainer = dx.Explainer(model, X, y)
  for i in observations:
    observation = X.iloc[[i]]
    print(f"Model's prediction for Pupil %d is %.2f" % (i, explainer.predict(observation)))
    for seed in range(4):
      random.seed(seed)
      np.random.seed(seed)
      explanation = explainer.predict_surrogate(observation)
      print(f"For random seed %d the explanation is:\n" % seed, explanation.result)
      explanation.plot()
      plt.show()
In [99]:
def explain_with_lime(model, X, y, observations=(0, 1, 314)):
  lime_explainer = lime.lime_tabular.LimeTabularExplainer(
    training_data=X.values,  
    feature_names=X.columns,
    mode="regression"
  )
  for i in observations:
    observation = X.values[i]
    lime_explanation = lime_explainer.explain_instance(
      data_row=observation,
      # Even though the explainer is defined with correct feature names,
      # calling model.predict yields a feature_names mismatch error
      # That's why I needed to use validate_features=False
      predict_fn=lambda d: model.predict(d, validate_features=False)
    )
    _ = lime_explanation.as_pyplot_figure()
    _ = lime_explanation.show_in_notebook()
    plt.show()
In [109]:
def explain_svm_model(X, y, observations=(0, 1, 314)):
  svm_ohe = SVR()
  svm_ohe.fit(X, y)
  explainer_svm = dx.Explainer(svm_ohe, X, label="SVM", verbose=False)
  for i in observations:
    observation = X.iloc[[i]]
    explanation_svm = explainer_svm.predict_surrogate(observation)
    explanation_svm.plot(return_figure=True)
    _ = plt.title(f'Explaining SVM predicting {np.round(explainer_svm.predict(observation).item(), 4)} for Pupil {i}')
    plt.show()
In [111]:
maths_X, maths_y = preprocess_dataset(maths_dataset)
maths_model = fit_boost_model(maths_X, maths_y)
explain_with_dalex(maths_model, maths_X, maths_y)
explain_with_lime(maths_model, maths_X, maths_y)
explain_svm_model(maths_X, maths_y)
[08:12:10] WARNING: /workspace/src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
Preparation of a new explainer is initiated

  -> data              : 395 rows 9 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 395 values
  -> model_class       : xgboost.sklearn.XGBRegressor (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function yhat_default at 0x7fa5d9157680> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = -1.15, mean = 10.4, max = 17.6
  -> model type        : regression will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -11.6, mean = -8.83e-05, max = 8.32
  -> model_info        : package xgboost

A new explainer has been created!
Model's prediction for Pupil 0 is 12.74
For random seed 0 the explanation is:
                     variable    effect
0           failures <= 0.00  2.850611
1    0.00 < absences <= 4.00  1.584589
2      schoolsup_yes <= 0.00  0.813356
3  0.00 < famsup_yes <= 1.00 -0.595950
4               Dalc <= 1.00  0.471273
5           studytime > 2.00  0.441088
6        1.00 < Walc <= 2.00 -0.272783
7    0.00 < paid_yes <= 1.00  0.079039
8         higher_yes <= 1.00  0.000000
For random seed 1 the explanation is:
                     variable    effect
0           failures <= 0.00  2.877127
1    0.00 < absences <= 4.00  1.543813
2      schoolsup_yes <= 0.00  0.718054
3               Dalc <= 1.00  0.610487
4  0.00 < famsup_yes <= 1.00 -0.573323
5           studytime > 2.00  0.550216
6        1.00 < Walc <= 2.00 -0.266017
7    0.00 < paid_yes <= 1.00  0.250263
8         higher_yes <= 1.00  0.000000
For random seed 2 the explanation is:
                     variable    effect
0           failures <= 0.00  2.913790
1    0.00 < absences <= 4.00  1.627636
2      schoolsup_yes <= 0.00  0.870581
3               Dalc <= 1.00  0.670115
4  0.00 < famsup_yes <= 1.00 -0.643384
5           studytime > 2.00  0.522842
6        1.00 < Walc <= 2.00 -0.505361
7    0.00 < paid_yes <= 1.00  0.211442
8         higher_yes <= 1.00  0.000000
For random seed 3 the explanation is:
                     variable    effect
0           failures <= 0.00  3.044379
1    0.00 < absences <= 4.00  1.700748
2      schoolsup_yes <= 0.00  0.727498
3               Dalc <= 1.00  0.721781
4  0.00 < famsup_yes <= 1.00 -0.653942
5           studytime > 2.00  0.560341
6        1.00 < Walc <= 2.00 -0.333138
7    0.00 < paid_yes <= 1.00  0.188638
8         higher_yes <= 1.00  0.000000
Model's prediction for Pupil 1 is 11.12
For random seed 0 the explanation is:
                     variable    effect
0           failures <= 0.00  3.021935
1           absences <= 0.00 -2.186546
2      schoolsup_yes <= 0.00  0.758813
3  0.00 < famsup_yes <= 1.00 -0.608354
4        1.00 < Dalc <= 2.00 -0.485566
5           studytime > 2.00  0.406492
6                Walc > 3.00 -0.102370
7    0.00 < paid_yes <= 1.00  0.080729
8         higher_yes <= 1.00  0.000000
For random seed 1 the explanation is:
                     variable    effect
0           failures <= 0.00  3.064010
1           absences <= 0.00 -2.202886
2        1.00 < Dalc <= 2.00 -0.690352
3      schoolsup_yes <= 0.00  0.540144
4  0.00 < famsup_yes <= 1.00 -0.516346
5           studytime > 2.00  0.514332
6    0.00 < paid_yes <= 1.00  0.299189
7                Walc > 3.00 -0.112604
8         higher_yes <= 1.00  0.000000
For random seed 2 the explanation is:
                     variable    effect
0           failures <= 0.00  3.103689
1           absences <= 0.00 -2.451086
2      schoolsup_yes <= 0.00  0.821322
3  0.00 < famsup_yes <= 1.00 -0.620742
4        1.00 < Dalc <= 2.00 -0.607279
5           studytime > 2.00  0.438333
6                Walc > 3.00 -0.267789
7    0.00 < paid_yes <= 1.00  0.184362
8         higher_yes <= 1.00  0.000000
For random seed 3 the explanation is:
                     variable    effect
0           failures <= 0.00  3.178143
1           absences <= 0.00 -2.332528
2      schoolsup_yes <= 0.00  0.635974
3  0.00 < famsup_yes <= 1.00 -0.609122
4        1.00 < Dalc <= 2.00 -0.592815
5           studytime > 2.00  0.544524
6                Walc > 3.00 -0.243277
7    0.00 < paid_yes <= 1.00  0.190861
8         higher_yes <= 1.00  0.000000
Model's prediction for Pupil 314 is 13.23
For random seed 0 the explanation is:
                   variable    effect
0         failures <= 0.00  2.952302
1  0.00 < absences <= 4.00  1.604575
2        studytime <= 1.00  0.862743
3    schoolsup_yes <= 0.00  0.740893
4       famsup_yes <= 0.00  0.653482
5              Dalc > 2.00 -0.405154
6         paid_yes <= 0.00 -0.085417
7      2.00 < Walc <= 3.00  0.073486
8       higher_yes <= 1.00  0.000000
For random seed 1 the explanation is:
                   variable    effect
0         failures <= 0.00  2.910717
1  0.00 < absences <= 4.00  1.611754
2        studytime <= 1.00  0.960294
3    schoolsup_yes <= 0.00  0.657108
4       famsup_yes <= 0.00  0.582400
5         paid_yes <= 0.00 -0.273020
6              Dalc > 2.00 -0.266574
7      2.00 < Walc <= 3.00  0.049339
8       higher_yes <= 1.00  0.000000
For random seed 2 the explanation is:
                   variable    effect
0         failures <= 0.00  2.933525
1  0.00 < absences <= 4.00  1.642675
2        studytime <= 1.00  0.875741
3    schoolsup_yes <= 0.00  0.801639
4       famsup_yes <= 0.00  0.682566
5              Dalc > 2.00 -0.400018
6         paid_yes <= 0.00 -0.219348
7      2.00 < Walc <= 3.00 -0.128272
8       higher_yes <= 1.00  0.000000
For random seed 3 the explanation is:
                   variable    effect
0         failures <= 0.00  3.095380
1  0.00 < absences <= 4.00  1.717284
2        studytime <= 1.00  0.986883
3       famsup_yes <= 0.00  0.680981
4              Dalc > 2.00 -0.674127
5    schoolsup_yes <= 0.00  0.630005
6         paid_yes <= 0.00 -0.185515
7      2.00 < Walc <= 3.00  0.004938
8       higher_yes <= 1.00  0.000000
/usr/local/lib/python3.7/dist-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but SVR was fitted with feature names
  "X does not have valid feature names, but"
In [95]: